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Abstract 

We consider a discrete time stocliastic queueing system where a controller makes a 2-stage decision every slot. The 
decision at the first stage reveals a hidden source of randomness with a control-dependent (but unknown) probability 
distribution. The decision at the second stage incurs a penalty vector that depends on this revealed randomness. 
The goal is to stabilize all queues and minimize a convex function of the time average penalty vector subject to an 
additional set of time average penalty constraints. This setting fits a wide class of stochastic optimization problems. 
This includes problems of opportunistic scheduling in wireless networks, where a 2-stage decision about channel 
measurement and packet transmission must be made every slot without knowledge of the underlying transmission 
success probabilities. We develop a simple max-weight algorithm that learns efficient behavior by averaging functionals 
of previous outcomes. The algorithm yields performance that can be pushed arbitrarily close to optimal, with a tradeoff 
in convergence time and delay. 

Index Terms 

Opportunistic scheduling, stochastic optimization, dynamic control, queueing analysis 

I. Introduction 

We consider a stochastic queueing system that operates in discrete time with unit timeslots t G {0, 1,2, . . .}. 
Every slot t, a controller makes a 2-stage control decision that affects queue dynamics and incurs a random penalty 
vector Specifically, the controller first chooses an action k{t) from a finite set of K "stage- 1" control actions, 
given by an action set K, = {!,..., K}. After the action k{t) e K, is chosen, a random vector uj{t) is revealed, 
which represents a collection of system parameters for slot t (such as channel states for a wireless system). The 
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random vectors u}{t) are conditionally i.i.d. with distribution function Fk{u}) over all slots for which k{t) = k, 
where Fk{u}) is defined: 

Fk{u:)^Pr[u}{t)<u}\k{t) = k] for fc G /C (1) 

where vector inequality is taken entrywise. However, the distribution functions Fk{u>) are unknown. Based on 
knowledge of the revealed uj{t) vector, the controller makes an additional decision I{t), where I{t) is chosen from 
some abstract (possibly infinite) set T. This decision affects the service rates and arrival processes of the queues 
on slot t, and additionally incurs an M-dimensional penalty vector x{t) = {xi{t), . . . ,XM{t)), where each entry 
me {1, . . . , M} is a function of I{t), k{t), and u>(t) according to known functions Xm{k{t), oj{t), I{t)): 

Xm{t) = Xm{k{t),u:{t),I{t)) forme {1,...,M} (2) 

The penalties can be either positive, zero, or negative (negative penalties can be used to represent rewards). Let 
x be the time average penalty vector that results from the control actions made over time (assuming temporarily 
that this time average is well defined). The goal is to develop a control policy that minimizes a convex function 
f{x) of the time average penalty vector, subject to queue stability and to an additional set of N linear constraints 
of the type hnix) < 6„ for n G {1, . • - jN}, where the constants 6„ are given and the functions hn{x) are linear 
over X e K^.' This objective is similar to the objectives treated in [1] [2] [3] for stochastic network optimization 
problems, and the problem can be addressed using the techniques given there in the following special cases: 

• (Special Case 1) There is no "stage-1" control action k{t), so that the revealed randomness U3{t) does not 
depend on any control decision. 

• (Special Case 2) The distribution functions Fk{u)) are known. 

An example of Special Case 1 is the problem of minimizing time average power expenditure in a multi-user 
wireless downlink (or uplink) with random time-varying channel states that are known at the beginning of every 
slot. Simple max-weight transmission policies are known to solve such problems, even without knowledge of the 
probability distributions for the channels or packet arrivals [4]. An example of Special Case 2 is the same system 
with the additional assumption that there is a cost to measuring channels at the beginning of each slot. In this 
example, we have the option of either measuring the channels (and thus having the hidden random channel states 
revealed to us) or transmitting blindly. Such a problem is treated in [5], and a related problem with partial channel 
measurement is treated in [6]. Both [5] and [6] solve the problem via max-weight algorithms that include an 
expectation with respect to the known joint channel state distribution. While it is reasonable to estimate the joint 
channel state distribution when channels are independent and/or when the number of channels M is small (and the 
number of possible states per channel is also small), such estimation becomes intractable in cases when channels 
are correlated and there are, say, 1024 possible states per channel (and hence there are 1024^ probabilities to be 
estimated in the joint channel state distribution). 

'For simplicity we treat the case of linear h„{x) functions here, although the analysis can be extended to treat convex (possibly non-linear) 
h„{x) functions, as considered in [1] for the case without "stage 1" control decisions. See also Remark 1 in Section II-D for a further discussion. 
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Another important example is that of dynamic packet routing and transmission scheduhng in a multi-commodity, 
multi-hop network with probabihstic channel errors and multi-receiver diversity. The Diversity Backpressure Routing 
(DIVBAR) algorithm of [7] reduces this problem to a 2-stage max-weight problem where each node decides which 
of the K commodities to transmit at the first stage. After transmission, the random vector of neighbor successes 
is revealed, and the "stage-2" packet forwarding decision is made. If there is a single commodity {K = 1), the 
problem of maximizing throughput reduces to a problem without "stage- 1" decisions, while if there is more than 
one commodity the solution given in [7] requires knowledge of the joint transmission success probabilities for all 
neighboring nodes. It is of considerable interest to design a modified algorithm that does not require such probabihty 
information. 

In this paper, we provide a framework for solving such problems without having a-priori knowledge of the 
underlying probabihty distributions. For simplicity, we focus primarily on 1-hop networks, although the techniques 
extend to multi-hop networks using the techniques of [IJ [8J. Our approach uses the observation that, rather than 
requiring an estimate of the full probability distributions, aU that is needed is an estimate of a set of expected max- 
weight functionals that depend on these distributions. These can be efficiently estimated using penalties incurred 
on previous transmissions to learn optimal behavior. 

Related stochastic network optimization problems (without the 2-stage decision and learning component) appear in 
[9] [1] [3] [2]. Work in [9] considers optimization of a utility function of time average throughput in an opportunistic 
scheduling scenario but without queues or stability constraints. Work in [1] [3] treats joint queue stability and 
performance optimization using Lyapunov optimization, and work in [2] treats similar problems in a fluid hmit 
sense using primal-dual methods. Sequential channel probing techniques via dynamic programming are treated 
in [10] [11] [12]. General methods for Q-leaming, based on approximate dynamic programming, are presented 
in [13]. Our approach is different and is based on simpler Lyapunov optimization techniques, which, due to the 
special structure of the problem, provide strong (polynomial) bounds on convergence even for high dimensional 
state spaces. Simple methods of pursuit learning and reinforcement learning, which try to converge to the repeated 
selection of an optimal single index that provides a maximum mean reward (without a-priori knowledge of the 
average rewards for each index), are considered in [14] and apphed to wireless rate selection in [15]. Our stage-1 
decision options can be viewed as a finite set of indices, and hence our problem is related to [14] [15]. However, our 
2-stage problem structure and the underlying stochastic queues, convex cost optimization, and multi-dimensional 
inequaUty constraints, make our problem much more complex. Further, the optimal policy may (and typically does) 
result in a probabihstic mixture of many different action modes, rather than a single fixed action. 

11. The Max Weight Learning Problem 

Consider a coUection of L discrete time queues Q{t) = {Q\{t), . . . ,QL{t)) with dynamic equation: 

Qi{t + 1) = max[Q((t) - /xi(t),0] + Ai{t) (3) 

where Ai (t) is the amount of new arrivals to queue I on slot t, and m {t) is the queue I server rate on slot t. These 
quantities are possibly affected by the two-stage control decision at slot t. SpecificaUy, let /C4{1, . . . , K} represent 
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the set of stage- 1 decision options, and let k{t) represent the stage- 1 decision made by the controller at time t, for 
t e {0, 1,2,.. .}. Recall that the corresponding random vector u){t) that is revealed is conditionally i.i.d. over all 
slots for which k{t) = k, with distribution function Fk{uj) given by (1). The Ffc(w) distributions are unknown to 
the controller. Let 1 be the (possibly infinite) set of stage-2 control actions, and let I{t) £ 1 denote the stage-2 
control action at time t. 

The arrival and service vectors A{t) = . . . , ^^(t)) and = {p-i{t), ■ ■ ■ ,l-J-L{t)) are determined by 

k{t), u}{t), I{t) according to (known) functions ai{k{t),u:{t), I{t)) and fi{k{t),u}{t), I{t)):~ 

Mt) = ai{k{t),u;{t),I{t)) 
l^i{t) = fii{k{t),u{t),I{t)) 

Likewise, the penalty vector x{t) — {xi{t), . . . ,XM{t)) is determined by the (known) penalty functions Xm{t) = 
Xm{k{t),uj{t),I{t)) for each me {1, . . . , M}. The penalties are (possibly negative) real numbers, and we assume 
that the penalty functions are bounded by finite constants x™'" and x™"^ for all m e {1, . . . , M}, so that: 

x^'"' < Xm{t) < xZ"-"" for all t 
Likewise, the queue arrivals and service rates are bounded as follows: 

< Ai{t) < for all t 

< fJ-iit) < fiT"'' foralH 

Aside from this boundedness, the functions a;(-), fii{-), and Xm{-) are otherwise arbitrary (possibly nonlinear, 
non-convex, and discontinuous). Define the time average penalty x{t), averaged over the first t slots, as follows: 

x(t)Ai|^E{a;(T)} 

^ r=0 

Let f{x) be a convex and continuous function over x G (possibly negative, non-monotonic, and non- 
differentiable). Let hn{x) for n e {1, . . . , N} be a collection of linear functions over x G R^. Note that since the 
x{t) penalties are bounded, the values of f{x{t)) and hn{x{t)) are also bounded. The goal is to design a control 

^The analysis is tlie same if a;(-), /t;(-), ^(O outcomes are random but i.i.d. given k{t), u>(t), I{t), with known means ai{-), Ai(')' ^(O 
that are used in the decision making part of the algoritlrai. 
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policy that makes 2-stage decisions over time so as to solve the following problem:^ 

Minimize: limsup^^^ f{x{t)) (4) 

Subject to: limsupj^o^ hn{x{t)) < 6„ for n e {1, . . . , N} (5) 

Stability of all queues • • • , (6) 

In cases when the time average penalty vector converges to some value x, the lim sup is equal to the regular limit 
and the above problem can be more simply stated as minimizing f{x) subject to hn{x) < bn for all n G {1, . . . , N} 
and to stability of all queues. The following notion of queue stability is used: 
Definition 1: A discrete time queue is strongly stable if: 

1 

limsup- V]E{|Q(t)|} < oo 
We shall use the term stability throughout to refer to strong stability. The definition above uses the absolute value 
of queue size because we shall soon introduce additional virtual queues that can take negative values. 

A. Auxiliary Variables for Nonlinear Cost Functions 

It is useful to write the cost function f{x) as a sum of linear (or affine) and non-linear components. Specifically, 
define M. as the set of all indices m e {1, . . . , M} for which there are penalty variables Xm{t) that participate in 
a non-linear component of f{x). Then we can write f{x) as follows: 

f{x) = l{x)+~f{A) 

where l{x) is a linear (or affine) function, x = | is a "sub-vector" of x that contains only entries x„i 
for TO G Ai, and f{x) are convex functions (and typically non-linear). Such a decomposition is always possible, 
and in principle we can choose the trivial decomposition Ai = {1, . . . , M}, l{x) — 0, x ~ x, which does not 
attempt to exploit linearity even if it exists in the cost function. However, it is useful to separate out the linear 
components, because we shall require one auxiliary variable 7„i(t) for each penalty Xm{t) that participates in a 
non-linear component of a cost function, while no such auxiliary variable is required for penalties that do not 
participate in any non-linear components.'* 

For each to g M, let jm{t) be a new variable that can be chosen as desired on each timeslot t, subject only to 
the constraint that: 

-<J<lm{t)< x™"^ + a for all TO e 7W (7) 

'while we assume the objective function f{x) is a general convex (possibly non-hnear) function, for simpUcity we assume the cost functions 

hn{x) are linear (see Remark 1 in Section II-D for extensions to non-linear h„(x) functions). Example linear constraints for a wireless system 
are average power constraints at each node, where hn{x) is a linear function that sums the relevant components of the penalty vector x[t) 
that correspond to instantaneous power expenditure at node n, and bn represents the average power constraint of node n. A typical non-linear 
objective for networks is the maximization of a concave utility function g{x) of the time average throughput, where g{x) selects only those 
entries Xm that correspond to throughput, and f{x) = —g{x). 

'^While it is possible to always define one auxiliary variable per penalty, exploiting linearity and reducing the number of auxiliary variables 
can be more direct and may lead to faster convergence times. 
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for some positive value tr > (to be chosen later). Let ^{t) = {"f^it)) l^g^i ^ vector of "fmit) components 
for m e A^. Define the time average 7(t) as follows: 



T = 

Then it is not difficult to show that the problem (4)-(6) is equivalent to the following: 
Minimize: lim supj_ 



(8) 



i{x{t))+fm)) 

Subject to: limsupj^o^ hn{x{t)) < 6„ for n e {1, . . . , N} (9) 

limt^oo [xm{t) - 7m(*)] =0 for TO e (10) 

Stability of all queues Q i (i ) , . . . , Ql (t) (11) 

Indeed, the equality constraint (10) indicates that the auxiliary variable 7m(t) can be used as a proxy for Xm{t) for 
all TO e M., so that the above problem is equivalent to (4)-(6). This is useful for stochastic optimization because 
7m (t) can be chosen deterministically as any real number that satisfies (7), whereas the penalty Xm{t) has random 
outcomes. These auxiliary variables are similar to those introduced in [3] [1] for optimizing a convex and non-linear 
function of a time average penalty in a stochastic network, which is a more general (and more complex) problem 
than that of optimizing a time average of a non-linear penalty function. In the special case when the objective 
function f{x) is itself linear (so that f{x) = and /(x) = l{x)), then no auxiUary variables are needed, the set 
A4 is empty, and the constraints (10) are irrelevant. 

B. Virtual Queues for Time Average Inequalities and Equalities 

To satisfy the time average inequaUty constraints in (9), we define one virtual queue Un{t) for each n G 
{1, . . . , N}, with dynamic queueing equation: 

Unit + 1) = max [Unit) + Kixit)) - 6„, 0] (12) 

This can be viewed as a discrete time queueing system with a constant "service rate" 6„ and with arrivals /i„(a;(t)), 
although we note in this case that the "arrivals" and/or the "service rate" can potentially be negative on a given 
slot t. The intuition is that stabilizing this virtual queue ensures that the time average "arrival rate" is less than 
or equal to &„. This is similar to the virtual queues used for average power constraints in [4] and average penalty 
constraints in [1]. 

To satisfy the time average equality constraints in (10), we introduce a generalized virtual queue Z^it) for each 
m& M, with dynamic equation: 

Zmit+l)=Zmit)-^mit)+Xmit) (13) 

This has a different structure because it enforces an equality constraint, and it can be either positive or negative. 
The following lenraia shows that stabiUzing the queues Unit) and Z^it) ensures that the corresponding inequaUty 
and equaUty constraints are satisifed. 
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Lemma 1: (Queue Stability Lemma) If the queues Un{t) and Zm{t) satisfy the following (for all n e {1, . . . , N} 
and m € M): 

lim *- = , lim ""^ = (14) 

t — ^oo t i — ^oo t 

Then all inequality constraints (9) and (10) are satisfied. Further, the condition (14) holds whenever the queues are 
strongly stable. 

Proof: Omitted for brevity (see [4] for a related proof). □ 

C. Lyapunov Functions 

Define @{t)=[Q{t);U{t); Z{t)] as the vector of all actual and virtual queue backlogs. To stabilize the queues, 
we define the following Lyapunov function: 

;=i n=l meM 

Note that this Lyapunov function grows large when the absolute vale of queue size is large, and hence keeping this 
function small also maintains stable queues. Define the one-step conditional Lyapunov drift as follows:^ 

A(©(t))AE {L{@{t + 1)) - L{@{t)) I @{t)} (15) 

Let y be a non-negative parameter used to control the proximity of our algorithm to the optimal solution of 
(8)-(ll). Using the framework of [1], we consider a control policy that observes the queue backlogs ®{t) and takes 
control actions on each slot t that minimize a bound on the following "drift plus penalty" expression: 

A(0(i)) + E {Vl{x{t)) + Vf{^{t)) I @{t)] 

Computing the Lyapunov drift A(0(t)) by squaring the queueing update equations (12), (13), (3) and taking 
conditional expectations leads to the following lemma. 
Lemma 2: (The RHS{-) Bound) For a general control policy we have: 

A{&{t)) + ¥.{vi{x{t)) + y/(7W) I &{t)} < B 
+^[vi{x{t)) + Vf{^{t))\&{t)} 

N 

-Y,Unm{hn-K{x{t))\@{t)} 
- Zra{m{^^{t)-Xrn{t)\@{t)} 



L 



^J2Qiit)E{,,i{t)-Mt)\&{t)} (16) 
1=1 

'strictly speaking, notation sliould be A{&(t),t), as the drift may be non-stationary. However, we use the simpler notation A(0(t)) as a 
formal representation of the right hand side of (15). 
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where B is a finite constant that satisfies the following for all t and all possible control actions that can be taken 
on slot t: 

N 

B > Y.^{ibn-hn{x{t))f\&{t)} 

n=l 

+ J2 ^{{lm{t)-X^it)f\&{t)} 
L 

+j2^{{Mt)-Mt)r\m} 
1=1 

Such a constant B exists because of the boundedness assumptions of the penalty and cost functions, and an explicit 
bound can be determined by considering the maximum squared values attained by the penalties and costs. 

Proof: The proof is a straightforward drift computation (see, for example, [1]), and is omitted for brevity. □ 
The next section analyzes the performance of policies that choose control actions every slot to (approximately) 
minimize the right hand side of the drift expression (16). 

D. The Performance Theorem 

Define /* as the optimal solution for the problem (4)-(6) (i.e., it is the infimum cost over all pohcies that satisfy 
the constraints). Define a value 6 such that < ^ < 1, and consider the class of restricted pohcies that have random 
exploration events independently with probability 6 every slot. If a given slot t is an exploration event, the stage- 1 
decision k{t) is chosen independently and uniformly over {1, . . . , K} (regardless of the state of the system at this 
time). We say that the slot is an exploration event of type k if the exploration event leads to the random choice 
of option k. Hence, exploration events of type k occur independently with probability 6/K every slot. We note 
that the stage-2 decision I{t) and the auxihary variables 7(t) can be chosen arbitrarily on every slot, regardless of 
whether or not the slot is an exploration event. 

If ^ > 0, the exploration events ensure that each stage- 1 control option is tested infinitely often. Define as the 
optimal solution of (4)-(6) subject to the additional constraint that such random exploration events are imposed. It 
shall be convenient to define optimahty in terms of fg. It is clear that /g = /*, and intuitively one expects that 
f$^f*^^^^ 0-^ Further, in systems where the optimal /* can be achieved by a pohcy that chooses each stage- 1 
control option a positive fraction of time, it can be shown that there exists a positive value 6* such that /* = fg 
whenever < 6 < 6*. We now assume the following properties hold concerning stationary and randomized control 
pohcies with random exploration events of probabiUty 6. 

Assumption 1 (Feasibility): There is a stationary and randomized pohcy that chooses a stage- 1 control action 
k* (t) e /C according to a fixed probability distribution such that each option is chosen with probability at least 9/K 
(reveahng a corresponding random vector u:*{t)), and chooses a stage-2 control action I*{t) e X as a potentially 

'Specifically, it can be shown that fg — » /* whenever Cmax > 0, where emax is defined in Assumption 2. 
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randomized function of u}*{t), such that: 

l{E{x*{t)}) + f{Y)=fS (17) 

bn-hn{E{x*{t)})>0 foraU ne {1,...,-/V} (18) 

IE{Atr(*)}-IE{A*(*)} > foralUe (19) 



where x*{t), n*{t), A*{t) are the penalty, service rate, and arrival vectors corresponding to the stationary and 
randomized pohcy, defined by: 

x*{t) = x{k*{t),u:*{t),r{t)) 
H*{t) = il{k*{t),u*{t),I*{t)) 
A*{t) = d{k*{t),u:*{t),r{t)) 

and where 7* is a vector with components i'ym)\meM ^^^^ 7m=^{^m(0} for all m e M. Note that 
a;™'" < Xm{t) < a;™"^ always, and so a;™'" < 7™ < a;™"^ for all m e Al. Thus, each component 7^ satisfies 
the required auxiliary variable constraint (7). 

This assumption states that the problem is feasible, and that the optimal value can be achieved by a particular 
stationary and randomized policy that meets the time average penalty constraints and ensures the time average 
service rate is greater than or equal to the time average arrival rate in all queues.' The next assumption states that 
the constraints are not only feasible, but have a useful slackness property. 

Assumption 2 (Slackness of Constraints): There is a value Cmax > together with a stationary and randomized 
policy that makes stage- 1 and stage-2 control decisions k'{t) G K and I'{t) G I such that each stage- 1 option is 
chosen with probability at least 6/K, and: 

bn - /i„(E {x'{t)}) > Cmax for all n S {1, . . . , TV} (20) 

E{fj,\{t)}-E{A[{t)}>e„ax foralUe {1,...,L} (21) 

where x'{t), n'{t), A'{t) are the penalty, service rate, and arrival vectors corresponding to the decisions k'{t) and 
I'it). 

Now define RHS{t, ©(t), k(t), I(t),^(t)) as the right hand side of the drift bound (16) with a given queue state 
&{t) and control actions k{t), I{t), ■j{t) at time t. Given a particular queue state &{t), define the max-weight 
control decisions k^'^{t), I'^^{t), 7'"^(f) as the ones that minimize the following conditional expectation over all 
alternative feasible control actions that can be made on slot t (subject to the 6 exploration probability):^ 

E {RHS{t, @{t), k{t), I{t), 7(i)) I @{t)} (22) 

'See [4] for a proof that optimality can be defined over the class of stationary, randomized policies for minimum power problems. 

*For simplicity, we implicitly assume that the infimum of (22) over all feasible control actions is achieved by a particular set of decisions, 
called the max-weight decisions. Else, the results can be recovered by defining the max-weight decisions according to a sequence of poUcies 
that converge to the infimum. 
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Note that the fc"*'" {t) decisions are still determined randomly in the case of exploration events of probabiUty 0, but 
are chosen to maximize the above expression whenever the current slot does not have an exploration event. 

The auxiUary vector ^{t) appears in separable terms on the right hand side of (16), and so the policy 7'""'(t) can 
be determined separately from the fc™"" (t) and Z"**" {t) decisions. It is computed by first observing the queue backlogs 
Zm{t) on each slot t, and choosing 'y"^^{t) as the solution to the following deterministic convex optimization: 

Minimize: Vf{j{t)) - Y.meM Zm{thm{t) (23) 

Subject to: a;™'" -a< ^rn{t) < a;™"^ +a for all m e 7W (24) 

If the non-linear function /(•y) is separable in the 7 vector (as is the case in many network optimization problems), 
the above optimization amounts to separately finding 7™'"(t) (for each m e M) as the minimum of a convex 
single- variable function over the closed interval defined by (24). 

While the 7™"'(i) can thus be computed, it is more challenging to determine the stage- 1 and stage-2 decisions 
that minimize the right hand side of (16), as this would require knowledge of the probability distributions Fi~{u}). 
We thus seek an approximation to the k™^{t) and I"^^{t) policies. Suppose the following additional assumption 
holds concerning such an approximation. 

Assumption 3 (Approximate Scheduling): Every slot t the queue backlogs 0(t) are observed and control decisions 
k{t) e K. (subject to exploration events with probabihty 9), I{t) e I, and 7(t) satisfying (7) are made to ensure 
the following: 

E {RHS{t, @it), kit), /W, 7 W)} < E {RHS{t, &{t), k"^^{t), /'"'-(f), 7™ W)} 

+C + Vev 

N L 

+ ^E{C/„(i)}ea+ E{|^mWI}e2 + I]E{gj(i)}eQ (25) 

n=l meM '=1 

where C, ey, cu, ez. eg are non-negative constants (independent of t). The expectation on the left hand side is with 
respect to the current queue state @{t) and the actual decisions k{t), I{t), 7(4) implemented, while the expectation 
on the right is with respect to the current queue state @{t) and the (possibly not implemented) max- weight decisions 
fc™"'(i), /'""'(t), 7'"'"(t) that minimize the right hand side of (16). 

We note that the structure of the approximation bound in (25) is typical for algorithms that attempt to select 
a control action based on imperfect knowledge of the probability distributions of the resulting x{t), n{t), A{t) 
vectors, as the resulting approximations are typically proportional to the V constant and the Un{t), \Zm{t)\, and 
Qi{t) queue sizes on the right hand side of (16). In the case of perfect implementation of the max-weight poUcy 
7™"'(i), 7'"'"(t), we have ey = eu = ez = cq = and C = 0. 

Theorem 1: (Performance Theorem) Suppose Assumptions 1 and 2 hold, and that a control algorithm is imple- 
mented that satisfies Assumption 3 with fixed control parameters V >0 and cr > 0. Suppose eg, ez, eu are small 
enough and a is chosen large enough to satisfy the following: 

(26) 
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Then all time average constraints (9)-(ll) hold. In particular, all queues are strongly stable and satisfy for all t: 

1-1 

t 



1 '"^ 

T=0 



N L 
5^E{f/„(r)}+ ^{\Zm{T)\} + Y.^{Ql{T)} 

n=l meM '=1 



< 



B + C + V{Uiff + Uiff + ev) ^ E{£(0(O))} ^^7) 



approx^ 



where Capprox is defined: 



^approx — mill[^maa: t ^max ^Qi*-^ ^-Z] 



„max 



and where /di// and fdiff are finite bounds that satisfy: 

l{x\) — l{x2) < Idiff for any xi, X2 in the set x™™ < x < x^ 
fill) - f{l2) <fdiff for any 71,72 in the set x™" - <t < 7 < x™"^ + <r 
Further, the time average cost satisfies:' 

lim sup f{x{t)) <f; + ev+5+{B + C)/V (28) 

t — ^00 

where we recall that f{x) = l{x) + f{x) and /| is the optimal solution of (8)-(ll) subject to exploration events 
with probability 9, and where 6 is defined: 

S={ldiff + fdiff ) max , — , 

Theorem 1 states that, under the given approximation assumptions, the algorithm stabilizes all queues and yields 
a time average cost that is within ey + (5 + 0{1/V) of the optimal value fg. Hence, this bound can be made 
arbitrarily close to + + ^ by choosing V suitably large, at the cost of a linear increase in average queue 
congestion with V. Further, we note that the terms ey and i5 tend to zero as the error values cy, cu, ez. eg 
tend to zero. In the special case when the exact max-weight policy is implemented every slot (so that every slot t 
the controller makes decisions k'^^{t), I"^^{t), ix™'^{t) that minimize the right hand side of (16)), then we have 
C = and ey = ec/ = ez = eg = ^ = 0. In this case, we can also choose ^ = so that performance is within 
0{1/V) of the optimal value /*. This special case is similar to the stochastic network optimization result of [1], 
with the exception that [1] assumes the convex cost function f{x) is non-decreasing in each entry of x (using 
auxiUary variables with "one-sided" virtual queues that are always non-negative), whereas here we treat a possibly 
non-monotonic cost function via (possibly negative) virtual queues Zm{t). 

Proof: (Theorem 1) See Appendix A. □ 
The following related theorem uses a variable V{t) parameter and allows for the uncertainty to tend to zero while 
achieving the exact penalty fg. Its proof follows as a simple consequence of the proof of Theorem 1. 

'The expression (28) holds for all t (without the limsup) in the special case when 0(0) = and f{x) is linear so that f{x) = l{x). The 
rate at which the limit converges in the general (non-linear) case is proportional to the rate at which the time average expectations of 7m(t) 
converge to the time average expectations of Xmit) for each m e jCi, which is roughly the average of \Zm{t)\/t. This is highlighted in the 
proof of the theorem, see inequality (38). 
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Theorem 2: (Variable V{t) parameter) Suppose Assumptions 1 and 2 hold. Let /3i and be values such that 
< /3i < /32 < 1. Assume that after some finite time to, we use a V{t) parameter that increases with time, so 
that V{t) = {t — to + l)^^Vb for all t > to and for some constant Vb > 0. Assume the queue states at time 
are arbitrary but finite, and assume we make control decisions k{t), I{t), 7(t) such that the following holds for all 
t> to (which is a modification of Assumption 3): 

E{RHS{t,@{t),k{t),I{t),-fm < E{RHSit,&{t),k^^{t),I"^^{t),'r'^^m 

+C{t) + V{t)ev{t) 

N L 

+ Y^E{Unmeu{t)+ J2 ^{\Zmmez{t)+Y,^{Qi{t)}eQ{t) 

n=l meM '=1 

where C{t), ev{t), eu{t), ez{t), £Q{t) are deterministic functions of time such that: 

lim ex{t) = 

t — >oo 

where x G {V, U, Z, Q}, and where: 

C{t) < 0{{t -to + if^) for t > to 
Then the time average constraints (9)-(10) hold, and all queues Qi{t) are mean rate stable, in the sense that: 

lim^i^=0 foralUe{l,...,L} 
Further, the time average cost converges to the optimal value : 

lim f{x{t)) = f; 

t— >(X) 

Proof: See Appendix B. □ 
This method of using an increasing V(t) parameter can be viewed as a stochastic analogue of classic diminishing 
step-size methods for static optimization problems [16]. We note that C{t) is assumed to increase at a rate slower 
than that of V{t), while the ex{t) functions can converge to zero with any rate. Note that mean rate stability is a 
weak form of stability, and does not imply that average queue sizes and delays are finite. In fact, typically average 
congestion and delay are necessarily infinite when exact cost optimization is achieved [17] [18]. 

Remark 1: The results of Theorems 1 and 2 can be generalized to allow the hn{x) functions to be convex 
(possibly non-linear) by using one auxiliary variable ^irn{t) for each penalty Xm(t), in which case the constraints 
(10) can be enforced by modifying the virtual queues Un{t) in (12) to Un{t) with dynamics: 

Unit + 1) = max[f/„(t) + /i„(7(i)) - K, 0] 

This has the disadvantage of creating more virtual queues (one for each penalty m G A4 rather than one for each 
penalty m e M.), but has the advantage of allowing for non-linear hn{x) functions. It has the additional advantage 
of removing the uncertain x{t) penalties from the drift terms corresponding to the queues Un{t). This ensures 
€u = whenever the auxiUary variables are chosen according to the max- weight rule 7"*"'(f) (which, due to 
separability, does not require knowledge of the Fk{uj) distributions). Similarly, one can also use auxiUary variables 
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in the cost function /{"fit)) (as a proxy for the f{x{t)) values), so that ey = 0. With these modifications, all 
uncertainty is isolated to ez and cq. 

Remark 2: Theorems 1 and 2 can be used for any form of approximate scheduling, including cases when 
the optimal I{t) decision involves a complex combinatorial choice that can only be approximated (or when the 
optimization for the auxiliary variable 7(f) is approximate). This is related to similar approximate scheduling 
results developed for systems without stage- 1 decisions in [1] [7] [19J [20 J. However, our main interest is when the 
approximation is due to the uncertainty in the probability distributions f/c(u)), and max- weight learning algorithms 
for this context are developed in the next section. 

III. Estimating the Max- Weight Functional 

Theorem 1 suggests that our control policy should make decisions for k{t), I{t), 'y{t) every slot in an effort to 
minimize the right hand side of (16). The optimal auxiliary variable decisions 7'""'(t) for this goal have already 
been established and are given by the solution of (23)-(24). Note that these decisions do not require knowledge of 
the -Ffe(w) distribution. Likewise, the optimal decision does not require knowledge of the Fk{u}) distribution. 

Specifically, given a collection of observed queue backlogs &{t) and an observed outcome (which is the result 
of the stage-1 decision k{t) that is chosen), I™^{t) is defined as the optimal solution to the following (breaking 
ties arbitrarily): 

Minimize: u>(t), /(t))) -h E^=i W/in(i(A:(i), W> ^W)) + 

T.meM Zm{t)xm{k{t), u:{t), I{t)) - Ef=i Qi{mi{k{t),u{t), m) - ai{k{t),u{t), im (29) 
Subject to: I{t) e I 

The complexity of making these T'^^{t) decisions depends on the physical structure of the network. The decisions 
are often trivial when the set I contains only a finite (and small) number of control options (such as when the 
decisions are to remain idle or serve a single queue), in which case the function (29) is simply compared on each 
of the different choices in I. For multi-hop networks with combinatorial resource allocation constraints, the choice 
of I"^^{t) might be difficult, although constant-factor approximations are often possible (see [1] [7] [19] [20]). 

The optimal k"'-''"{t) decisions can be defined in terms of the I™'^{t) decisions as follows: On each slot t, k"^^{t) 
is chosen as k, according an independent type-fc exploration event, with probability 6/K. If no exploration event 
occurs on slot t (which happens with probability 1—6), the queue backlogs 0(t) are observed and /c™"'(t) is chosen 
as the value A: e {1, . . . , K} with the lowest value of ek{t) (breaking ties arbitrarily), where efe(i) is defined: 

efe(t)AE|mm[n(7,a;(t),0(t))] | k{t) = /e,0(i)| (30) 

where u>{t) is the random outcome that results from the stage-1 choice k{t) = k, and the function Yk{I,u}, 0) is 
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defined for a particular stage-2 decision /, outcome o), and queue state = [Q; U; Z], as follows: 

AT 

n(7,a;,0) 4 Vl{x{k,u,,I)) + J2Unhn{x{k,u;,I)) 

n=l 

+ ^ ZmXm{k,U},I) 

L 

-Y,Qi[f^iik,^,I)-ai{k,^,I)\ (31) 
;=i 

Thus, efe(t) is the expected value of the expression (29) over the distribution -Ffe(w) for the (jj{t) random variable 
that arises from choosing k{t) = k, assuming that the optimal stage-2 decision I'^^{t) is then made. However, 
computation of the exact ek{t) values would typically require full knowledge of the probability distributions Fk{oj) 
(and the computation may be difficult even if these distributions are fully known). Rather than using the exact 
conditional expectations, we consider two forms of estimates. 

A. Estimating the ek{t) value — Approach 1 

Define an integer W that represents a moving average window size. For each stage-1 option fee {1, . . . , K} and 
each time t, define uj^^\t), . . . ,(jj^{t) as the actual (jj{t) outcomes observed over the last W type-fc exploration 
events that took place before time t. Define the estimate ek{t) as follows: 

1 ^ 

In the case when there have not yet been W previous type-fc exploration events by time t, the estimate ek{t) is taken 
with respect to the (fewer than W) events, and is set to zero if no such events have occurred. The estimates ek{t) can 
be viewed as empirical averages of the function (31), using the current queue backlogs @{t) = [Q{t); U{t); Z{t)] 
but using the outcomes u}w\t) observed on previous type-fc exploration events and the corresponding optimal 
stage-2 decisions. 

Note that one might define ek{t) according to an average over the past W slots on which stage-1 decision k has 
been made, rather than over the past W type-fc exploration events. The reason we have used exploration events 
is to overcome the subtle "inspection paradox" issues involved in samphng the previous a>(r) outcomes. Indeed, 
even though u'(t) is generated in an i.i.d. way every slot in which A;(t) = fc is chosen, the distribution of the 
last-seen outcome w that corresponds to a particular decision k may be skewed in favor of creating larger penalties. 
This is because our algorithm may choose to avoid decision k for a longer period of time if this last outcome 
was non-favorable. Sampling at random type-A; exploration events ensures that our samples indeed form an i.i.d. 
sequence. An additional difficulty remains: Even though these samples {u:w\t)} form an i.i.d. sequence, they are 
not independent of the queue values &{t), as these prior outcomes have influenced the current queue states. We 
overcome this difficulty in Section III-D via a delayed-queue analysis. 

This form of estimation does not require knowledge of the -Ffe(u') distributions. However, evaluation of ek{t) 
requires W computations of the type (29) on each slot t, according to the value of each particular u:^^ {t) vector. 



n(/,u;W(t),e(t)) 
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This can be difficult in the case when W is large, and hence the next subsection describes a second estimation 
approach that uses only one such computation per slot. 

B. Estimating the ek{t) value — Approach 2 

Again let W be an integer moving average window size. For each stage- 1 decision k € {!,..., X}, define 
U3''i\t), . . . the same as in Approach 1. Further define @'"i\t), . . . , ©[^'(t) as the corresponding queue 

backlogs at the latest W type-fc exploration events before time t. Define an estimate ek{t) as follows: 

w — l 

The efe(t) estimate is adjusted appropriately if fewer than W type-fc exploration events have occurred (being set to 
zero initially). This approach is different from Approach 1 in that the current queue backlogs are not used. Hence, 
this is simply an empirical average over the past W samples of the actual cost achieved in the /"'"' (r) computation 
(29) at those particular sample times r. Because /'"""(r) (and its corresponding cost) was already computed on 
slot T in order to make the stage-2 control decision, we can simply reuse the same value, without requiring any 
additional computation of problems of type (29). 

C. The Max-Weight Learning Algorithm 

Let 9 a. given exploration probability (so that Q < 6 < 1 and exploration events of type K occur with 
probability 9/ K). Let cr > be a given parameter, and let V{t) be a given (non-negative) control function of slot t 
(possibly a constant function). Let W{t) be a (possibly constant) function such that W{t) > 1 for all t, and define 
Wo=W{0). Define the actual window size used at slot t (for either Approach 1 or Approach 2) as follows: 

W{t)^mm[W{t),Wrand{t)] 

where Wrand{t) is the minimum number exploration events that have occurred for any type (minimized over the 
types k e {1, . . . , K}), including the Wq events that take place at initiaUzation as described below. Thus, there are 
always at least W{t) type-fc exploration events by time t. The Max-Weight Learning Algorithm is as follows. 

• (InitiaUzation) For a given integer Wq > 0, let @{—KWo) = 0, and run the system over slots t = 
{—WqK, —WoK -\- 1, . . . , —1}, choosing each stage-1 decision option k G {1, . . . , K} in a fixed round- 
robin order (and choosing /'""(t) according to (29) and 7™"'(t) according to (23)-(24)). This ensures that 
we have Wq independent samples by time 0, and creates a possibly non-zero initial queue state 0(0). Next 
perform the following sequence of actions for each slot t>0. 

• (Stage-1 Decisions) Independently with probability 6, decide to have an exploration event. If there is an 
exploration event, choose k{t) uniformly over all options {1, . . . ,K}. If there is no exploration event, then 
under Approach 1 we observe current queue backlogs &{t) and compute ek{t) for each k G {1,...,K} 
(using window size W{t)). We then choose k{t) as the index k G {1, . . . , K} that minimizes ek{t) (breaking 
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ties arbitrarily). Under Approach 2, if there is no exploration event we choose k{t) to minimize ek{t) (using 
window size W{t)). 

• (Stage-2 Decisions) Observe the queue backlogs @{t) and the outcome u}{t) that resulted from the stage- 
1 decision. Then choose I'^^{t) € I according to (29). Choose auxihary variables 'y"^'^{t) according to 
(23)-(24). 

• (Past Value Storage) For Approach 1, store the resulting u}{t) vector in memory as appropriate. For Approach 
2, store the resulting cost from (29) in memory as appropriate. 

• (Queue Updates) Update virtual queues Un{t) according to (12) and Zm{t) according to (13). Also allow the 
actual system queues Qi{t) to proceed according to (3). 

Remark 3: For some systems, we may not require an exploration event for each of the K stage- 1 decision options. 
For example, in an L-queue downlink where the decisions are to either measure all channels, blindly transmit over 
one of the L channels, or remain idle (as in [5]), there are K = L + 2 stage- 1 options. However, the "idle" choice 
does not require any exploration events, as it clearly incurs a cost of 0. Further, the information gained by randomly 
choosing to blindly transmit over a given channel can also be gained by measuring all channels, as the outcome 
of the channel measurement can be used to determine if a blind transmission would have been successful. It is 
therefore more efficient to modify the algorithm by considering only one type of exploration event: the one that 
randomly chooses to measure all channels. Similarly, in DIVBAR-like situations where the K decisions involve 
sending a packet of one of the various commodities (as in [7]), the success/failure event observed after sending 
any particular packet does not depend on the packet commodity and hence can be used to update the max-weight 
estimates for each commodity. 

D. Analysis of the Max-Weight Learning Algorithm 

For brevity, we analyze only Approach 2.^" Let k"^^{t) denote the (ideal) max-weight stage- 1 decision on slot 
t, and let k{t) denote the Approach 2 decision. Recall that Approach 2 also uses the (ideal) I"^^{t) and 'y"^'^{t) 
decisions. Our goal is to compute parameters C, ey, eu, e^, eg for (25) that can be plugged into Theorem 1. 

Theorem 3: (Performance Under Approach 2 — Fixed Window) Suppose the Max- Weight Learning Algorithm 
with Approach 2 is implemented using an exploration probability ^ > 0. Suppose we use a fixed integer window 
size W = Wo > (so that W{t) = W for all t, and our initiaUzation takes W samples from each exploration 
type before time 0). Suppose that V{t) is held constant, so that V{t) = V for some V > 0. Then condition (25) 
of Assumption 3 holds with: 

= n ' ey = £[/ = ez = eg = 



where c and yJJ// are constants that are independent of queue backlog and of V, W, (and depend on the maximum 
and minimum penalties and maximum queue changes that can occur on one slot). 

'"Bounds on the performance of Approach 1 can be obtained similarly. In practice, Approach 1 would typically have superior performance 
because it uses current queue backlogs. 
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Proof: See Appendix C. □ 
It follows that if the fixed window size W is chosen to be suitably large, then the ey, eu, ez, eg constants will 

be small enough to satisfy the conditions eu < emax, ez < a, eg < Cmax required for Theorem 1, and hence the 

result of Theorem 1 holds for this max-weight learning algorithm. 

Theorem 4: (Performance Under Approach 2 — Variable W{t) and V{t)) Suppose that we use the Max- Weight 

Learning algorithm (with Approach 2) using an exploration probability 9 > and a variable V{t) and W{t) with 

initialization parameter Wq = 1, and with: 

V{t) = {t + l)f^Vo , W{t) = mm[{t + lf\Wrand{t)] 

where /3i and (32 are constants such that < /3i < /?2 < 1, K) is a positive constant, and where we recall 
that Wrand{t) is the minimum number exploration events of type k that have occurred, minimized over all e 
{1, . . . , K}. Then the time average constraints (9)-(10) hold, all queues Qi{t) are mean rate stable, and the time 
average cost converges to the optimal value : 

lim !{x{fy) = n 

t — *oo 

Proof: The proof combines results from the proofs of Theorems 3 and 2, and is given in Appendix E. □ 

IV. Conclusion 

This work extends the important max-weight framework for stochastic network optimization to a context with 
2-stage decisions and unknown distributions that govern the stochastics at the first stage. This is useful in a variety 
of contexts, including transmission scheduling in wireless networks in unknown environments and with unknown 
channels. The learning algorithms developed here are based on estimates of expected max-weight functionals, and 
are much more efficient than algorithms that would attempt to learn the complete probability distributions associated 
with the system. Our analysis provides expUcit bounds on the deviation from optimality in terms of the sample size 
W and the control parameter V. The W and V parameters also affect an explicit tradeoff in average congestion 
and delay. A modified algorithm with time-varying W{t) and V{t) parameters was shown to converge to exact 
optimal performance while keeping all queues mean-rate stable, at the cost of incurring a possibly infinite average 
congestion and delay. 

Appendix A — Proof of Theorem 1 

Proof: (Theorem 1 — The Queue Stability Inequality (27)) Writing the drift inequality (16) using the RHS{-) 
function yields: 

E{vi{x{t)) + Vf{j{t)) I e{t)} + A{&{t)) < E{RHS{t,&{t),k{t),I{t),j{t))\&{t)} 
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Taking expectations of both sides with respect to the queue state distribution for ®{t) and using the law of iterated 
expectations yields: 

VE [l{x{t)) + /(7(i))} + E {L{&{t + 1))} - E {L{&{t))} < E {RHS{t, &{t), k{t), /(t), 7 W)} 

< E {RHS{t, &{t), fc™"'(i), 7™'"(t),7'""'(t))} 
+C + Vev + ez ^ 

N L 

+eu J2e {Unit)} + eQ^E{Q lit)} (32) 

Tl=l 1 = 1 

< E {RHSit, &it), k'it), I' it), 7' W)} 
+C + Vev + ez ^il^^WD 



N L 



+€uJ2E{Unit)} + eQj2^{Qiit)} (33) 



n=l 1=1 

where (32) holds by Assumption 3, and (33) holds because the max-weight policy minimizes the expectation of 
RHSi') over all alternative decisions for slot t. The decisions k'it), I'it), 7'(t) can be chosen as any feasible 
control decisions for slot t (where a feasible control decision for k'it) must also respect the random exploration 
events of probability 6). Suppose that k'it) and I'it) are the decisions given in Assumption 2, so that properties 
(20) and (21) hold. Choose auxihary decision variables 7'(i) = (7m(0)meM follows: 

, / ^{X'mit)}+^ if ^„(i)>0 

Imit) = < (34) 

[ E{x'^it)}-a if <0 

Note that these ^y'^^it) decisions satisfy the required constraints (7). That is because for each m e jVl we have 
^rnin < ^{x'^(t)} < a;™^ and therefore: 

x™" - a < E {x'^it)} -a<E {x'^it)} + a < x^" + a 
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Using these 7^(t) decisions and the definition of RHS{-) in the inequality (33) yields:'^ 

VE{l{x{t))+f{'r{t))}+E{L{&{t+l))}-E{L{&{t))} < B + C + Vev 



+E{vi{x'{t)) + Vf{j'{t))} 
- J2 K{Zmm{xUt)}-xUm 



^ E{|Z„(i)|[a-ex]} 
- ^ E {C/„(t)[6„ - hr,{x'{t)) - eu]} 



N 



n=l 

L 

-J2E{Qi{t)[ii'i{t)-A[{t)-eQ]} (35) 
1=1 

Note that because the policies k'{t) and I'{t) are stationary, randomized, and independent of the queue backlog 
vector @{t), and because the functions hn{x) are linear or affine, we have: 

E{Un{t)hn{x'{t))} = E{Un{t)}hn{E{x'{t)}) 
E{Zrr,{t)x'^{t)} = E{Zrn{t)}E{x'^it)} 

E{Qi{t)[t,[{t)-A[m = E{Qi{t)}E{t,[{t)-Am 

Using these identities together with properties (20)-(21) directly in the right hand side of (35) and rearranging terms 
yields: 

E{L{@{t + l))}-E{L{@{t))} < B + C + V[ldiff + fdiff + ev] 

N 

y^max 

)5^E{[7„(t)} 
-{a-ez) J2 

L 

eQ)Y,^{Qi{t)} (36) 
1=1 

where we have used the following fact: 

E{l{x'{t)) - l{xm < Uiff , E {/(7') - /(7(t))} < hiff 

The inequaUty (36) holds for all slots t G {0, 1,2,.. .}. Summing the telescoping series over re {0, 1, . . . , f — 1} 
(as in [1]) and dividing by t yields: 

E{L(0(i))}-E{L(0(O))} 



t-i 



T=0 



^ <B + C + V[ldiff + fdiff + ev] 

N L 

) ^E{[/„(r)} + (ct - ez) ^ E {|Z„(r)|} + (e 

max 



''Recall that RHS{-) is defined as the right hand side of (16). 
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Using non-negativity of the Lyapunov function i( ) in the above inequaUty proves (27). Taking the limsup of 
(27) as i 00 proves that the queues Qi{t), Zm{t), Un{t) are strongly stable (for all? e {1, . . . , L}, m € M, 
n e {1, . . . , A^}). Hence (by Lemma 1), the inequality constraints (9)-(l 1) are satisfied. □ 
Proof: (Theorem 1 — The Utihty Inequahty (28)) Recall that the inequality (33) holds for any alternative set 
of feasible control decisions k"{t), I"{t), 7"(i). Re-writing (33) using this notation and using the definition of 
RHS{-) yields: 

VE{l{x{t))+f{j{t))}+E{L{&{t + l))}-E{L{@{t))} < B + C + Vev + ez J2 ^{l^mWI} 

+E{vi{x"{t)) + Vf{-y"{t))} 

N 

-J2^{Unmbn - hn{x"{t)) - Cu)} 
n=l 

- ^ E{zut)w;.{t)-x';,m 

-j2E{Qi{tM{t)-A'/{t)-eQ)} 
1=1 

Let a be a probability (to be chosen later), and define joint control actions {k"{t);I"{t);'y"{t)) as follows: 

[ (fc*(t);/*(t);7*) withprob. 1-a 
where k'{t), I'{t) are as defined in Assumplion 2 (and satisfy (20)-(21)), variables 7'„(t) are as defined in (34), 
and I*{t), k*{t), 7* are as defined in Assumption 1 (and satisfy properties (17)-(19)). Note that the k"{t) decision 
defined here still has random exploration events with probability 6, as both k'{t) and k*{t) have such events. Also 
note that 7^(i) satisfies (7) because both 7^(t) and 7j^ satisfy (7). Further, we have: 

E{x"{t)} = aE{x'{t)} + {l-a)E{x*{t)} 
^{l"{t)} = aE{7'(t)} + (l-a)7* 
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It follows from properties (20)-(21) and (17)-(19) (together with Unearity of l{x) and hn{x) and the fact that 
the randomized k"{t) and I"{t) choices are independent of queue backlog) that: 

VE\^l{x{t))+f{'r{t))}+E{L{@{t + l))}-E{L{&{t))} < B + C + Vev 

+(1 - a)Vl{E{x*it)}) + (1 - a)F/(7*) 
+aVl{E{x'{t)}) + aVE{f{j'{t))} 

N 

n=l 

- E{|Z„(t)|}(aa-ez) 

L 

eg) 

1=1 

Now choose a as follows: 

£[/ (Q 



a = max 



^max ^ ^max 

This is a valid probability because we have assumed that eu < ^max, < a, eg < €max- The above inequality 
reduces to: 

VE {l{x{t)) + /(7(t)) } + E {L{&{t + 1))} - E {L{&m < 

B + C + Vev + Vf; + aV{lai ff + fdiff) (37) 
The above inequality holds for all t. Taking a telescoping series over r G {0, 1, . . . , t — 1} yields: 

Therefore, using d=a{ldiff + fdiff), non-negativity of L(-), and Jensen's inequality with convexity of l{x) and 
f{x), we have: 

<W))+mw)</.- + .v+^+^ + 5M) 

However, we have: 

mt)) > f{A{t)) - Mi^Mt) - 7(011 

where v is the magnitude of the largest left or right partial derivative of the /(•) function and M is the cardinality 
of M.}"^ Combining the above two inequalities and using the fact that f{x)=l{x) + f{x) yields: 

f{x{t)) - Mv\ m - 7(t) 1 1 < + ev^ + ^ + + ^^^^^/"^^^ (38) 

Because the equality constraints (10) hold, we have that \ \x{t) — 7(t)|| — > 0. Taking the limsup of (38) as t — > oo 
thus yields (28), completing the proof. □ 

'^Left and right partial derivatives exist and are finite for any convex function that is defined over the full space R*^. 
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Note that in the special case when there are no auxiliary variables (so that f{x) is linear and f{x) = l{x)), and 
when all queues are initially empty, the inequality (38) reduces to the following cost guarantee that holds for all 
time t: 

f{xit))<fe+iv + S+{B + C)/V 

Appendix B — Proof of the Variable V{t) Theorem (Theorem 2) 

Proof: (Mean Rate Stability of all Queues) Assume without loss of generaUty that eu{t) < emax> ^z{t) < cr, 
^q(*) < ^max for all f > to (else, choose a time to for which this holds). Then, on a single slot t, we can apply the 
result from the proof of Theorem 1 with V=V{t) and ex=ex{t) (for x e {V, U, Z, Q}). Thus, for any time t>to 
we have from (36): 

E{L(0(t + 1))} - E{L(0(t))} < B + C{t) + V{t)[ldiff + /*// + ev{t)] 

where we have neglected the three non-positive terms on the right hand side of (36). Summing the above inequality 
over r e {to, . . . ,t—l} yields: 

E{£(0(t))}-E{£(0fto))} , 

t — to t — to 

where we have used the fact that Er=to ^ and J2l~Jt„ V{t) < 0{t'^^+^). Because L{@{t)) is a 

sum of squared queue lengths (for all queues), the above inequality implies that for any queue Qi{t): 

t — ^0 t — to t — to 

Dividing the above inequality by t — to, taking square roots, and using the fact that E{Qi{t)^} > E{Qi{t)}^ 
yields: 

MQim^ I B 0{tP^+^) E{L(0fa))T 
t-to -\l {t-to) {t- to)^ {t- tof 

Because /32 + 1 < 2, the right hand side above converges to as f — > oo. This holds for all queues Qi{t), and 

hence all these queues are mean rate stable. Similarly, it holds for all queues Zm{t) and J7„(t) (for m G M. and 

n G {1, . . . , N}), and so all these queues are mean rate stable. It follows by Lemma 1 that all inequaUty constraints 

(9)-(10) are satisfied. □ 

Proof: (Cost Optimality) Again assume (without loss of generaUty) that eu{t) < tmax, cz{t) < a, eQ{t) < Cmax 

for alH > to. We thus have from (37) that: 

where a{t) is defined: 



a(t)4niax 



eu{t) ez{t) CQit) 



a 
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and satisfies a{t) ^ as t — » oo. Tlie above holds for all t > to. Summing over r g {to, . . . ,t— 1} yields: 
2E{z(x(r)) + /(7(r))}+ ^5 E{L(0(r))} 



1 



T = to T=to + l 



y(T - 1) v{t) 

t-i 



E{£(0ft))} E{£(0fa))} ^ 

vit - 1) y(io) 



r=to 



B + C{t) _^ ^^^^^ ^ a{T){ldiff - fdiff) 



V{t) 

Using non-negativity of L(-) and the fact that v(t-i) " vjj] — ^ (because V{t) is non-decreasing), and dividing 
by {t — to) yields: 

- ^ « + ,39, 

where \E'(t) is defined: 

^ ^ t-to ^ 



*~1 _l_ (J(^\ 

+ ev{T) + a{T){ldifs - fdiff) 



T=ta <- ^ ' 

Note that C{t) /V{t) ^ as t ^ oo, and hence '^{t) is the time average of a function that converges to 0. We 
thus have ^'(t) ^ as t ^ 0. By Jensen's inequality applied to the left hand side of (39) we have: 

where x{t) and 7(i) are time average expectations over the interval t G {to, . . . ,t — 1}. Because we already know 
Zm{t) is mean rate stable for all m e M, we have that \ \^{t) — x{t)\ \ — > as f — > oo (by Lemma 1), and hence, 
as in the proof of Theorem 1 (using f{x) = l{x) + f{x)): 

limsup/(x(t)) < /g* 

Because is defined as the infimum cost subject to queue stability, it can be shown that the liminf cannot be 
lower than f^, and so the limit of f{x{t)) exists and is equal to the limsup, proving the result. □ 

Appendix C — Proof of Theorem 3 
To prove Theorem 3, fix time t and define Q{@{t)) as follows: 

n{&{t))AE [RHS{t, ©(t), fc(t), /™-(i),7"" W) I ®{t)} 
-E {RHS{t, @{t),k"''"{t), /"'"(i), 7™"'(t)) I &{t)} 

Now note that because these right-hand sides differ only in terms comprising the ek{t) expression, we have: 

n{&{t)) = E {e^(,)(t) I &{t)} - min[efc(i)] 

where the expectation on the right hand side is over the random decision k{t) ~ argminfeg^[efe(t)], which is 
based on the empirical average efe(t) formed by the past W random samples. It uses the fact that given a particular 

'^It can be shown that the infimum cost subject to stwng stability is the same as the infimum cost subject to mean rate stability. 
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distribution function F^.,^^{u))) is e^.,^y Now for each fc G /C, define Sk{t)=ek{t) — ek{t).We thus have: 



(possibly sub-optimal) decision k{t) G IC, the resulting expected max- weight functional (using the exact but unknown 

fc(t)('^)) is e^^y 

< E {i^,){t) I &{t)] + E |max[-5fc(t)] | &{t) 



= E {mm[efe(t)] | 0(o| + E |max[-5fe(0] | 0(0 
= E|nun[efc(i) +4(0] I ©wj + E (max[-(5fe(i)] | 0(0 



if 

< min[efc(0] + 5" E {max[4(0. 0] + max[-5fc(0, 0] | 0(0} 



if 

min[efe(O]+^E{|4(OI|0(O} 



It follows that: 



keic 

fe=i 



if 



n{@{t))<J2^{Mt)\\&{t)} 

k=l 

Therefore, by iterated expectations we have: 

K 

E{n(0(O)}<$^E{|4(O|} (40) 

fc=l 

Note that E {0(0(0)} corresponds to the desired inequahty (25), and hence it suffices to bound E{|(5fc(t)|}. To 
this end, for each fc G {1, . . . , K}, define Tk{t) as the number of timeslots that passed after the VFth-latest type k 
exploration event. Thus, all samples u^w^ (t) occur on type-fc explorations events, and are on or after time t — Tk{t). 
Define 0fc(O=0(t - Tfc(O). We have: 

l^feCOl = \~ek{t) - eu{t)\ < m) - e-r^COl + Ie~r^(0 - er'(OI + \er\t) - efc(OI (41) 
where &^'^^{t) and e^^''^{t) are defined using queue lengths from the previous time t — Tk{t) as follows: 

1 ^ 

e-r^CO 4 -5^niin[n(7,u,W(O,0(t-T,(O))] 

W — 1 

er\t) A E|mm[yfc(J,u>(O,0(t-rfc(O))]|0(t-rfe(O),fc(O = fc| 

where the expectation in the definition of e^^^it) is with respect to the independent outcome u}{t) that has 
distribution Fk{u)). Comparing the definition of e^^^'"{t) to the definition of ek{t) in (30), it is clear that they 
are different only in that they use different queue states (similarly, ek{t) and e^'^^^(0 differ only in that they use 
different queue states). Because the maximum change in queue size on any single slot is bounded, we have the 
following lemma. 
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Lemma 3: For any k G {1,. . . , K}, any time t, and regardless of queue backlog &{t), we have: 

E{|efc(i) - er^t)\ + ler^'W - efeWII < d^E{n{t)} (42) 

where di is a constant that is proportional to the maximum change in any queue over a single slot, and is independent 
of the current queue sizes and of W and K. 
Proof: Define Iw\t) as follows: 



We have: 



1 ^ r 



< 



< 



1 ^ 

-5^n(7W(i),u;W(t),0W(t)) 



■w=l 



1 ^ 

- 5] n(7W(t),a,W(t), 0(i - Tfe(t))) + ciT,(i) 



It;— i 

= er''(t) + cir,(t) 

where ci is a constant that is proportional to the maximum change of any queue value over one slot. With an almost 
identical argument, it can be shown that e^^^it) < ek{t) + C27fe(t), where C2 is a constant that is proportional to 
the maximum change of any queue value over one slot. Thus: 



Therefore: 



Similarly, we can show: 



\ek{t) - ei"'"'{t)\ < niax[ci,C2]rfe(t) 
E{|efc(0 - er^m < max[ci,C2]E{rfc(t)} 
E{\ekit) - el'^'^m < csE{Tk{t)} 



Defining fii=niax[ci, C2] + C3 proves the lemma. □ 
It now suffices to bound K {le^!"''^' (t) — e^^'^^{t)\}- For a given k E {1, . . . , K} and a given collection of queue 
states &{t — Tk{t)) at time t — Tk{t), define the following function Y{uj): 

y(w)Amin [Yk{I, &{t - Tfe(t)))] (43) 

Note that e^^^^{t) is simply an empirical average of the function Y{ijj) over W i.i.d. samples u)w\t) (which have 
distribution Fk{u})). Note that these values are also independent of the queue state &{t — Tk{t)), as these samples 
are taken on or after time t — Tk{t). Further, the value e^'^^i^) simply an expected value of the random variable 
Y{ijj) over all outcomes a> that take place with distribution Ffe(u>). Hence we have reduced the problem to a 
pure "Law of Large Numbers" problem of bounding the expected difference between the exact mean of a random 
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variable and its empirical average over W i.i.d. samples. Because the queue backlogs &{t — Tkit)) are considered 
constant in F(u>), we can write Y{(jj) in terms of component random variables as follows (using (31)): 



N L 

" " " (44) 



mm 



N L 

VYv{u^) + Unit - Tk{t))Yu,nM + ^ Z„,{t - Tk{t))Yz,M -J^Qii^' Tk{t))YQ ,i{u}) 

n=l meM '=1 



where Yv{(jj), Yu,n{<jj), Yz,n{<jj), Yq^i{u}) are random variables defined as (from (31)): 

n-M A l{x{k,u;,C)) (45) 

Yu,M 4 /i„(x(fc,a;,/;)) (46) 

Yz,m{oj) 4 x„(fc,u;,/*) (47) 

Fq/w) 4 fiiik,u:,C)-ai{k,uj,C) (48) 

where is the stage-2 control action that achieves the min in (44). Now define Yy, Yu^n, Y z.m, Yqj as 
the expectations of the random variables in (45)-(48) over the random variable u) that has distribution Fk{u>), 
and define Yy^ \ y}j^\ Y^^, ^q!! ' corresponding empirical averages over the i.i.d. samples 0;^'' (for 



w 



e {1, . . . , W}). We thus have: 



N 



n=l 

+ Z^{t-nmY^^^ - Yz,rn)+j2(^l{t-nmYgJ^ - Yqj 

meM '=1 



and hence: 



N 

leTit) - er^'m < F|yr^ -Yv\+j2un{t- nmyZ^ - 

n=l 

L 

+ ^ \Z^(t-nm\yZi -^^'rn\ + Y.(^l{t)\Y^'J^ -Yqj\ (49) 

m&M '=1 

We now use the following basic lemma concerning the expected difference between an empirical average and its 
exact mean: 

Lemma 4: Let {I'u,}^^! be an i.i.d. sequence of random variables with a general distribution with finite support, 
so that there are finite constants ymin and ymax such that: 

Vmin <Y.u,< Vmax for all u; € {1, 2, . . .} 

Define ydiff=ymax — ymin- Define Y as the expectation of Yi, and define Y^^^ as the empirical average over W 
samples: Y^^^A^ EILi Y^,- Then: 



2VW 

Proof: The proof is straightforward and is given in Appendix D for completeness. □ 
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Because all penalties and cost functions are upper and lower bounded, the random variables in (45)-(48) have 
finite support, and we define y^J// the maximum difference in the maximum and minimum possible values over 
all of the random variables. Using Lemma 4 in (49) yields: 



nVer^it) - er^WI I ®{t-n{t)),n{t)} < 



„ . IILUi Jl. 



N 



< 



2VW 

ydiff 



V + J2Un{t-Tkit)) 

n=l 

J2 |^™(i-Tfe(t))|+^Q;(*-7fc(t)) 



2VW 
+d2Tk{t) 



N 



n=l 



where d2 is a constant that depends on the maximum change in queue backlog on a given slot. Taking expectations 
of the above and using the law of iterated expectations yields: 

N L 

W.{\er\t)-er\t)\} < 



2VW 



n=l 



Using the above inequality with (42), (41) in (40) yields: 



E{0(©(t))} < 



Ty- max 
^Vdiff 



N 



K 



2^/W 



(50) 



n=l meM '=1 I fe=l 

where c is a constant that depends on the maximum possible change in queue backlogs over one slot. The random 
variable Tk{t) can be viewed as a sum of W geometric random variables (each with mean K/6), with the possible 
exception when t is small and some of the past W samples occur during the initiahzation time r G {—WK, — WK+ 
1, . . . , — 1}. Therefore, for all t and all k we have: 

]E{Tfe(i)} < WK/e + WK 

Then inequality (50) satisfies the condition (25) from Assumption 3 with: 



= = ez = eg = 



2VW 



^ , C^c[WK^/9 + WK'^] 



This completes the proof of Theorem 3. 



Appendix D — Proof of Lemma 4 



Proof: We have: 



E 



{|y(M^)_F||'<E{|FW-F|2} 



W 
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where is the variance of Yi. It suffices to bound in terms of the constants Umin, Umax, and ydiff- We have: 

a^ = Var{Y^) = VariY^-ymin) 

= E { - y^i^f } - (F - y^i^f 

< E {{ymax - ymm)(>l - ymin)} - {Y - yminf (51) 

= {Y-ymin){y max ymin ymin ) ) 

= {Y - ymin){ymax - y) (52) 

where (51) holds because Yi — ymin > 0. To compute the final bound on the expression in (52), note that ymin < 
y < ymax, and the maximum of the function f{x) = {x - ymin){ymax - x) over the interval ymin <x< ymax 
is equal to {ymax - yminf/^- Thus, a'^ < yj^f f / 4:. □ 

Appendix E — Proof of Theorem 4 

The proof of Theorem 3 can be followed in the same way, with the exception that the fixed value W is replaced 
by the random value W{t) (which may be correlated with queue states). Therefore, repeating the proof in Appendix 
C, the result of (50) translates to: 

E {mm < 5]^E I ^ + Unit) + Em\Zm{t)\ + El Qljt) | + , ^ ffi {T,(t)} 

Each term E {Tk{t)} can be bounded by W{t)K/6 + WqK. The final term can thus be bounded as follows: 

K 

c^E{rfc(t)} < w{t)K''/e + WQK'' 

k=l 

where W{t)^t + 1)^\ Define Ci{t)AW{t)K'^ /e + WqE"^. 
It is not difficult to show that Wrand{t) satisfies: 

Wrandit) 

lim = — with probability 1 

t^oo t K 

However, W{t) increases sub-linearly with t. Therefore, because W{t)=vcAn]W{t),Wrand{t)], we have: 

lim Pr[W{f) / W{t)] = 

t — *oo 

Furthermore, because Wrand{t) is simply the min of K delayed renewal processes Wiit), . . . , WK{t) (each having 
i.i.d. geometric inter-arrival times with mean K/6), we have by the union bound: 

Pr[W{t) ^ W{t)] = Pr [m\n[Wi{t), Wxit)] <{t+ 1)/^'] < KPr [Wi{t) <{t + 1)1^^] 



It follows that: 



lim tPr[W{t) ^ W{t)\ = 
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Therefore: 



Ky'^Tf ^ [ V + En Unit) + Em \Zm{t)\ + El Qljt) ] < 

2 \ Vr^i) J" 

E y + ^ C/„(t) + ^ \Z^{t)\ + ^ Oi(t) I Wit) = W^(t) I Pr[l^(t) = Wit)] 



2^Wit) 

max ( 1 

+^!Z/e V + ^ t/„(t) + \Zmit)\ + 1] QKi) I Wit) ^ Wit) \ PT[Wit) ^ Wit)] 

K n m I ) 

where we have used the fact that Wit) > 1 always. Adding the (non-negative) conditional expectation to complete 
the first term on the right hand side yields: 

KyTn ^ [ V + En Unit) + Em \Zmit)\ + El Qlit) ] < 



2^W(t) 



^/wx^) 

E J y + ^ u^(t) + \Zmit)\ + Y ^' W 



max ( ~| 
+ }v + YUnit)+Y\Zmit)\+Y. I ^ ^(*) \ ^^tW^W ^ 



,,max 



< ^fkL E IV + Y Unit) + Y \Zmit)\ + Y W 



2^fW{^) 



Ky^l'icot 
+ '^'^^ Pr[Wit) ^ Wit)] 



where cq is a constant that is proportional to the maximum change in any queue over one slot. Because tPr[Wit) ^ 
Wit)] — > as f — > oo, there exists a time to such that for alH > to we have: 

'^'^^ Pr[Wit) ^ Wit)] < 1 

We can now define C(t)4Ci(t) + 1 for use in Theorem 2 (note that C(t) < 0((t -to + 1)^')). Further define: 

rr max 

2^Wit) 

for a; € {V, U, Z, Q}. This satisfies the assumptions of Theorem 2, proving the result. 
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